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The game of Matching Pennies (MP), a simplified version of the more popular Rock, Papers, Scissors, 
schematically represents competitions between organisms with incentives to predict each other’s 
behavior. Optimal performance in iterated MP competitions involves the production of random choice 
patterns and the detection of nonrandomness in the opponent’s choices. The purpose of this study was 
to replicate systematic deviations from optimal choice observed in humans when playing MP, and to 
establish whether suboptimal performance was better described by a modified linear learning model or 
by a more cognitively sophisticated reinforcement-tracking model. Two pairs of pigeons played iterated 
MP competitions; payoffs for successful choices (e.g., “Rock” vs. “Scissors”) varied within experimental 
sessions and across experimental conditions, and were signaled by visual stimuli. Pigeons’ behavior 
adjusted to payoff matrices; divergences from optimal play were analogous to those usually 
demonstrated by humans, except for the tendency of pigeons to persist on prior choices. Suboptimal 
play was well characterized by a linear learning model of the kind widely used to describe human 
performance. This linear learning model may thus serve as default account of competitive performance 
against which the imputation of cognitively sophisticated processes can be evaluated. 
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Psychology, economics, and ethology are 
concerned with how human and nonhuman 
organisms choose between alternatives in the 
context of scarce resources. Choices made by 
one organism are often linked to those made 
by others, such that obtaining a resource 
depends on what others do. In competitive 
scenarios — ^where one’s gain is another’s 
loss — the chances of obtaining a preferred 
resource are greatly enhanced if an oppo- 
nent’s choices are anticipated. Consequently, 
natural selection condemns predictable behav- 
ior and favors the detection of nonrandom- 
ness among competitors (Miller, 1997). In 
predator-prey encounters, for instance, un- 
predictable motor behavior is afforded by 
inheritable morphological symmetries and 
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sensory-motor systems that allow fast changes 
of speed and direction (Driver & Humphries, 
1988). It is unclear, however, the extent to 
which individual animals may learn unpredict- 
able behaviors by repeated exposure to pred- 
ator/prey-associated stimuli, and if such a 
learning process is general or niche specific. 

Mixed strategy games generically describe 
the kind of competitive situations considered 
here, where unpredictability is encouraged 
(Camerer, 2003). This class of interaction is 
characteristic of the relation between cheetahs 
and gazelles, goalkeepers and penalty-kickers, 
cops and robbers, and any other pair of agents 
where a pursuer must predict the actions of an 
evasive opponent. Representing competitions 
as mixed strategy games permits the specifica- 
tion of normative (Nash) equilibria in the 
distribution of choices made by two oppo- 
nents. Consider the Rock, Paper, Scissors game. 
If a player chose to always play one alternative 
(rock, paper, or scissors), she would easily be 
defeated by an opponent who always made the 
complementary move (paper, scissors, or 
rock) . Thus, it is not optimal to pick a single 
alternative unless the opponent picks a single 
alternative. It is easy to see how, if both players 
understand the rules of the game and maxi- 
mize their chances to win, their choice strategy 
will converge at selecting any alternative 
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randomly with p= 1/3; any deviation from this 
strategy could he exploited hy the opponent 
and reduce one’s chances to win. When each 
player chooses between rock, paper, and 
scissors with p = 1/3 the Nash equilihrium of 
the game is achieved. 

When investigating actual behavior in mixed 
strategy games, there are two important 
considerations regarding Nash equilibria. 
First, perfect randomization is not necessary 
for optimal performance — as long as the 
opponent cannot predict a player’s move, the 
player’s chances to win are as high as they can 
be. Although randomization is optimal regard- 
less of the opponent’s competence (and thus 
will serve here as criterion of optimality), 
mixed strategy games encourage only unpre- 
dictable behavior. Second, Nash equilibria 
specify what players would learn if they were 
maximizing payoffs, but not how they would 
learn it. In this paper, we are concerned with 
whether Nash equilibria are attained by real 
players in a mixed strategy game, and how they 
learn to attain it. 

Experimental studies on how individual 
living organisms actually learn mixed strategies 
have been focused on human players (for a 
review, see Camerer, 2003), with only a few 
studies on other species, mostly primates 
(Dorris & Glimcher, 2004; Flood, Lenden- 
mann & Rapoport, 1983; Lee, Conroy, 
McGreevy & Barraclough, 2004; Lee, 
McGreevy, & Barraclough, 2005). Research 
on human randomization suggests systematic 
suboptimal biases, but quasirandom sequences 
may be learned (Neuringer, 1986). Nonhu- 
man organisms have also shown learned 
randomization (Machado, 1989). Although 
the demands of repeated mixed strategy games 
may enhance randomization in humans (Ra- 
poport & Budescu, 1992), subjects with exten- 
sive training systematically fail to produce 
serially independent choice sequences: Hu- 
mans tend to overalternate (Brown & Ro- 
senthal, 1990; Towse & Maclachlan, 1999), 
whereas rhesus monkeys {Macaca mulatto) 
tend to perseverate (Lee et ak, 2004). The 
capacity to detect nonrandom patterns in 
humans appears to be limited by memory 
and other computational constraints (Falk & 
Konold, 1997); comparable data are not yet 
available from nonhuman subjects. Without a 
wider base of studies involving nonhuman 
subjects, it is difficult to establish whether 
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Fig. 1. Schematic representation of MP competition. 
Each of two players (Same and Different) have a choice 
between heads and tails. Same wins if both players make the 
same choice; Different wins if players make different choices. 

sophisticated cognitive mechanisms, such as 
top-down executive functions, are necessary 
for the acquisition of novel and optimal 
competitive behavior. 

We examined how pigeons {Columba livid) 
learn to compete against a conspecific in a 
mixed strategy game known as Matching 
Pennies (MP), a two-choice version of Rock, 
Paper, Scissors. Matching Pennies involves two 
players, each with a penny that can be played 
heads or tails and an assigned role as Same or 
Different. If both players play the same side of 
the coin. Same keeps both coins; if each player 
plays a different side of the coin, Different keeps 
both coins (Figure 1). As in Rock, Paper, 
Scissors, optimal play in iterated MP competi- 
tions involves producing unpredictable se- 
quences of choices and detecting nonrandom- 
ness in the opponent’s choices. This is also the 
Nash equilibrium of MP: Against a choice- 
randomizing opponent, the best strategy is to 
randomize one’s own choices. 

We evaluated performance of pigeons in MP 
against two predictions derived from optimal- 
ity: (1) choices are serially independent, and 
(2) in each game, choices stabilize at the Nash 
equilibrium. Then, we compared two learning 
models as accounts for the performance of 
pigeons and their deviations from optimality. 
Finally, we validated the best learning model 
by using its parameter estimates to simulate 
MP performance. 
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METHOD 

Subjects 

Four adult pigeons {Columba livia) were 
housed individually in a room with a 12:12-hr 
day: night cycle, with dawn at 0600 hr. They 
had free access to water and grit in their home 
cages. The pigeons’ running weights were 
based on 80% of their free-feeding weights. 
Each pigeon was weighed immediately prior to 
an experimental session and was excluded 
from a session if its weight exceeded 8% of its 
running weight. When required, a supplemen- 
tary feeding of ACE-HI pigeon pellets (Star 
Milling Co.) was given at the end of each day, 
at least 12 hr before experimental sessions 
were conducted. Supplementary feeding 
amounts were equal to 50% of the average 
amount fed over the last 15 days, plus 50% of 
the current deviation from target running 
weight. 

Apparatus 

Experimental sessions were conducted in 
four modular test chambers (305 mm long, 
241 mm wide, and 292 mm high), each 
enclosed in a sound- and light-attenuating 
box equipped with a ventilating fan. The floor 
consisted of thin metal bars positioned above a 
catch pan. The front and rear walls and the 
ceiling of the experimental chambers were 
made of clear plastic, and the front wall was 
hinged and functioned as a door to the 
chamber. One of the two aluminum side 
panels served as a test panel. The test panel 
contained three plastic transparent response 
keys (25 mm in diameter) aligned horizontal- 
ly, 70 mm from the ceiling. The keys could be 
illuminated by white, green and red light 
emitted from two diodes located behind the 
keys. A rectangular opening (52 mm wide, 
57 mm high) located 20 mm above the floor 
and centered on the test panel could provide 
access to milo (grain sorghum) when a grain 
hopper behind the panel was activated. A 
house light was mounted 12 mm from the 
ceiling on the side wall opposite the test panel. 
The ventilation fan mounted on the rear wall 
of the sound-attenuating chamber provided 
masking noise of 60 dB. Experimental events 
were arranged via a Med-PC® interface con- 
nected to a PC controlled by Med-PC IV® 
software. 
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Fig. 2. Choice procedure for unbia.sing protocol and 
MP game. A peck on the center key (illuminated according 
to block or role) illuminated both side keys. A peck on a 
side key constituted a choice (left = heads, right = tails); it 
changed key color to red and extinguished the opposite 
side key before delivering food or a blackout. 


Procedure 

Unbiasing protocol. Before starting the ex- 
periment proper, we sought to reduce any bias 
toward pecking either choice key when illumi- 
nated with colors used during the experiment. 
Each of six daily sessions consisted of one 
block of consecutive “green” trials, and one 
block of consecutive “white” trials, presented 
in random order. Each trial — diagrammed in 
Figure 2 — began with the darkening of the 
house light and the illumination of the middle 
key with the assigned color (green or white) . A 
key peck extinguished the middle key and 
illuminated both choice keys with the assigned 
color. A key peck on either choice key 
changed its color to red and extinguished 
the opposite key. After a 2-s delay, the red- 
colored key was also extinguished and food — 
which served as reinforcer — ^was presented 
probabilistically for 2.5 s. The probability of 
reinforcement of each choice was approxi- 
mately the fraction of total pecks made on the 
opposite key. A 2.5-s blackout was presented 
when reinforcement was absent. The house 
light was illuminated between trials for 2.5 s. 
“Color” blocks alternated after 30 feedings; 
key colors were not related to reinforcement 
probability. Sessions ended after 60 feedings 
or 90 min, whichever happened first. 


172 


FEDERICO SANABRIA and ERIC THRAILKILL 


Table 1 


Number of sessions and duration of reinforcers in each game. 






Duration of Reinforcers 



Number of Sessions 

Pigeon A 

Pigeon 

B 

Game 

Pigeons At, B1 

Pigeons A2, B2 

Heads 

Tails 

Heads 

Tails 

1 

21 

20 

short 

short 

short 

short 

2 

32 

35 

short 

LONG 

short 

short 

3 

33 

47 

short 

LONG 

short 

LONG 

4 

21 

29 

LONG 

short 

LONG 

short 


Note. “Short” reinforcers were 2 or 2.5 s of access to food. “LONG” reinforcers were 7.5 s of access to food. 
Unreinforced choices were followed by 2.5-s blackout. 


Matching Pennies (MP) game. After complet- 
ing the unbiasing protocol, pigeons were 
paired without regard to preexperimental 
performance, and played a series of MP games 
exclusively against each other. Choices in 
these games were arranged similarly to those 
in the unbiasing protocol (Figure 2). Sessions 
were divided in two parts. During the first half 
of each session, one pigeon played the role of 
Same, where reinforcers were delivered only 
after the subject pecked on the same side as 
its opponent. Meanwhile, the other pigeon 
played Different, and reinforcers were delivered 
only after it pecked on the opposite side than 
its opponent did. Roles were assigned ran- 
domly at the beginning of the session and were 
signaled by the color of both keys. The size of 
each reinforcer — the payoff— was constant 
within sessions, but varied across experimental 
conditons, ranging between 2 and 7 s of access 
to food. The first half of each session ended 
once any 1 of the 2 pigeons had accumulated 
at least 80 s of access to food over the session. 
During the second half of the session, roles 
were reversed. The second half of the session 
ended when 1 of the 2 pigeons had accumu- 
lated at least 160 s of food access over the 
session. 

Before the beginning of each trial, the 
house light was turned on for 10 to 12 s. The 
variability in intertrial interval was used to 
synchronize the beginning of each trial across 
pigeons. Each trial began with the extinction 
of the house light simultaneously with the 
illumination of the middle key on the 
response panel. The middle key was illuminat- 
ed green, when playing Same, or white, when 
playing Different. A peck on the middle key 
extinguished it and illuminated both left and 
right choice keys {heads and tails, respectively) 


according to role. A peck on either choice key 
turned the pecked key red and extinguished 
the key not pecked. The chosen key remained 
red until 2 s after the opponent made its 
choice, then it was turned off and the 
reinforcer was delivered if it was due; if it was 
not due, a 2.5-s blackout was presented 
instead. If a choice was not made within 10 s 
from the beginning of the trial, the key lights 
were turned off, the choice for that round was 
marked as a forfeit, and a 2.5-s blackout 
ensued. When the opponent forfeited, either 
choice was reinforced. Forfeits, which repre- 
sented 1% of the trials, were excluded from 
analysis and thus heads and tails choices added 
to 100%. Reinforcement and blackouts 
marked the end of each trial. 

Payoffs were varied from game to game by 
changing the duration of reinforcers. Table 1 
indicates the number of sessions played and 
the duration of reinforcers in each game. In 
Game 1, the durations of heads and tails 
reinforcers were identical; in Game 2, the 
duration of tails reinforcers was increased only 
for one pigeon (A) in each pair; in Game 3, 
the duration of tails reinforcers was increased 
also for pigeon B in each pair; in Game 4, the 
duration of tails and heads reinforcers were 
switched for both pigeons in each pair. 
Unreinforced choices were always followed by 
a 2.5-s chamber blackout. Note that assign- 
ment of pigeons as A or B was not related to 
role assignment: In each session, both pigeons, 
A and B, played Same and Different. 

RESULTS 

Serial Independence and Choice Perseveration 

Optimal choices in iterated Matching Pen- 
nies games are unpredictable, and thus should 
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Fig. 3. Probability of choosing the same alternative 1 
to 10 trials in the future, in each of the four games. 
Probability of repeating a choice was calculated as y = 
^(heads in lag Olheads in lag x) + ^(tails in lag Oltails in lag 
x) . Fitted curves trace exponential decay functions of the 
form y = yo + ae~^, with yo ^ varying across games. 
Fitting was conducted using SigmaPlot 2004 for 
Windows 9.01. 


be independent of prior choices. To evaluate 
this expectation, we examined the probability 
of repeating a choice in future trials. In an 
unpredictable sequence of choices, the prob- 
ability of repeating the same choice after any 
number of trials should be .5. To the extent 
that this probability is above .5, it demonstrates 
perseveration in choice; to the extent that it is 
below .5, it demonstrates overalternation. This 
measure is insensitive to preference towards an 
alternative: If a player prefers playing heads 
over tails, the probability of repeating a heads 
choice would be over .5, but the probability of 
repeating a tails choice would be below .5, 
averaging at .5 if there are no perserveration 
or overalternation trends. 

Figure 3 shows the probability that choice in 
any trial t would be repeated in trial t+ x, with 
X ranging between 1 and 10. The probability of 
repeating a choice was above .5 in all games, 
regardless of lag, which indicates a systematic 
positive influence of past choices on current 
choices in pigeons. This influence appears to 
decay exponentially over time, faster in Games 
1, 2, and 3 than in Game 4, as shown by the 
fitted curves in Figure 3. We may therefore 
represent choice on any given trial as a 
function of an exponentially weighted moving 
average (EWMA) of prior choices, 

p',+i = aCi -f (1 - a)p't, 0 < a < 1 (1) 


where the p' t is the predicted probability of a 
choice in trial t. For clarity, it will be assumed 
throughout this paper that the variable of 
interest is the probability of heads choices; our 
inferences would apply also to tails choices, 
because their probability complements the 
probability of heads choices. Thus, p' t is the 
probability of heads in trial t, Ct is 1 if heads 
was chosen in t, or zero if tails was chosen in t; 
a, the only free parameter, is the rate of decay 
of the moving average (1 = no decay, 0 = 
immediate decay). To illustrate how a was 
estimated, consider the following example: 
Suppose a choice of heads in trial 99 was 
expected with /j'gg = .8, but the actual choice 
was tails (Cgg = 0). If a = .5, the expected 
choice of heads for trial 100 would be p' = 
.5 X 0 + .5 X .8 — .4; if oc — .1, p ' ~ .1 X 0 -t 
.9 X .8 = .72. Fligher values of ot indicate that 
current choices are more predictive of subse- 
quent choices. Fitting a to each pigeon’s 
choices across all four games, using the 
method of maximum likelihood^ yielded a 
median of .074, ranging between .039 and .103 
(see EWMA in Table 2). Thus, the typical in- 
fluence of a choice on the subsequent choice 
was systematic but relatively small, and it de- 
cayed to about half (3.7%) between t + \ and t 
+ 10 (i.e., ot(l - ot)“’“^ = 0.5a(l - a)^“^). 

Nash Equilibrium and Sensitivity to Own Payoffs 

The second prediction from optimal iterat- 
ed play of Matching Pennies is that choices 
stabilize near the Nash equilibrium. The Nash 
equilibrium is the probability of choosing 
heads such that neither player has an incentive 
to change how its own choices are allocated. 
The incentive to change choice distribution is 
absent only when the expected utility {EU) 
obtained from choosing heads or tails is the 
same. To estimate the stable allocation of 
choices predicted by Nash equilibrium, it may 
be assumed that the EU of choosing an 
alternative is the product of the utility of 
reinforcement (which we expected to covary 
positively with duration of reinforcers) and the 

' Maximum likelihood was computed by varying free 
parameters to fit predictions (p' ^ to data (Ct) on each 
trial. Fitting was conducted by maximizing 

^log[C,/, + (1 - C,)(l (FI) 

which yields the log-likelihood of the model, the log 
probability of the data given the best estimates of model 
parameters. 
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Table 2 

Parameters of EWMA (Eq. 1), 
SLLp. and WRM. 


Parameter 

A1 

A2 

B1 

B2 

EWMA 

a 

.039 

.103 

.079 

.069 

SLLp 

«/. 

.110 

.241 

.175 

.041 

)’ 

.032 

.099 

.020 

.010 


.077 

.201 

.090 

.011 

LLR* 

111.1 

26.5.4 

130.6 

137.7 

WRM 

CLm 

.126 

.119 

.005 

.017 

s 

.308 

.404 

.632 

.666 

pHFyjIS 

0.748 

1.16 

1.12 

1.18 

LLR* 

101.6 

1.35.0 

75.3 

133.2 


* LLRs (log-likelihood ratios) are computed as L(model) 
— L(EWMA), where L(Y) is the log-likelihood of model Y. 


probability of reinforcement. Probability of 
reinforcement depended on the opponents’ 
choices and on role being played: For exam- 
ple, the probability that a heads choice will be 
reinforced while playing Same is the probability 
that the opponent will also choose heads. 
Thus, we can express the EU of choosing heads 
while playing Same as EUs{II) = poUsiH), 
where pr, is the probability tbat the opponent 
(who is playing Different) will choose heads, 
and Us{H) is the utility of a reinforcer 
obtained for choosing heads. The subscripts 
in this equation indicate the player to whom 
each variable is ascribed: Tbe utility of 
reinforcement and the expected utility corre- 
spond to one player {S in our example), 
whereas the probability of choosing heads 
correspond to the opponent (D in our 
example). At the Nash equilibrium, wben 
EUs of playing heads and tails are equal, we 
may solve for the opponent’s p, and this value 
may be contrasted against data to verify that 
choices converged at the Nash equilibrium. To 
solve for p in eacb role, a more general 
algebraic statement of tbe Nash equilibria 
must be made: 

EUd{H)=EUd{T) = {l-ps) Ud{H) = PsUd{ T) 
EUs{H)=EUs{T)=poUs{H)={l-po)Us{T) ’ 

The first line of Equation 2 may be read as 
“while playing Different, the expected utility of 
choosing heads or tails is the same, which 
means that the probability tbat the opponent 
will choose tails (1 — pf}, times the utility of a 
reinforced heads choice, is equal to the 


probability that the opponent will choose 
heads times the utility of a reinforced tails 
choice.” The second line may be read analo- 
gously. From these equations ps and pD may be 
solved: 

= MM) 

vs Uu(T) + Un{H) 

* = 1 - 

Pd ^ Us(T) + Us{H) 

To determine whether choices converged 
on Nash equilibria, or if there were any 
systematic deviations from this normative 
prediction, pg, pi, and utilities for each pigeon 
were fitted to the proportions of heads choices 
during the last 5 sessions of each game (i.e., 
after the pigeons had at least 15 sessions of 
experience). To the extent that pg and pD 
successfully tracked changes in heads choices, 
Nash equilibrium would be validated as a 
behavioral predictor. There were two con- 
straints to the estimation of utilities in the 
Nash equilibrium model: (1) Utilities did not 
change across roles or games within the same 
pigeon — for example, the utility of a 2.5-s 
reinforcer did not change for a pigeon 
regardless of whether it was playing Same, 
Different, Game 1, or Game 4; and (2) utilities 
did not covary negatively with duration of 
reinforcement within the same pigeon — for 
example, the utility of a 2.5-s reinforcer was 
equal to or less than the utility of a 7.5-s 
reinforcer. This implied two free parameters 
in the Nash equilibrium model fitted to each 
pigeon, one for the utility of each of the two 
reinforcer durations used. A third parameter 
was included to account for payoff-indepen- 
dent biases in choice — for example, Us{H) was 
the utility of the reinforcer of successful heads 
choices while playing Same, plus the “utility” 
of choosing heads. 

Figure 4 shows the proportion of heads 
choices in each role (thin and dotted lines) 
and estimates of pg (thick gray bars) and po 
(thick black bars) across games. From simply 
looking at the thick bars and comparing them 
to the terminal distribution of choices in each 
game, it is difficult to determine if the Nash 
equilibria correcdy predicted where choices 
would converge — compare, for instance, data 
from pigeon A1 in Game 1, where the Nash 
equilibria fared well, witb data from pigeon B2 
in Game 4, where they did not. In some games 
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Fig. 4. Proportion of heads choices across MP games. Pigeon A1 played against Bl, and pigeon A2 played against B2. 
Horizontal bars indicate expected proportion of heads choices according to a fitted Nash Equilibrium model 
(Equations). The © symbol represents the successful choice with larger reward (top = heads, bottom = tails). 
Deviations from horizontal bars towards © are indicative of own-payoffs effect. 


it is not even clear that choices stabilized after 
training (e.g., pigeon A2 in Game 4). In almost 
all cases, however, whether heads was chosen 
more often while playing one role or another 
coincided with predictions from the Nash 
equilibria. A more stringent assessment may 
be based on two important features of choices 
predicted by the Nash equilibria that are 
revealed by Equation 3. 

First, note the difference in subscripts 
between the left and right hand sides of 
Equation 3: they indicate that optimal choices 
are not determined by one’s own payoffs, but 
by the opponent’s. This implies that if payoffs 
are larger for tails relative to heads, it is the 
opponent who should adjust her choices, 
choosing heads less often if playing Same, or 
more often if playing Different. Here is an 
intuitive account of why this should happen: If 
pigeon AI plays Different and its payoff for 
successful tails choices increases (as in Game 
2), it would be motivated to choose heads less 
often — but so would be its opponent, Bl, who 
is playing Same. This would discourage Al from 


choosing heads less often than tails, but only 
while Bl makes it less likely that Al would win 
if it chose tails. Thus, a change in ATs payoffs 
should be reflected in BTs behavior. Knowl- 
edge of the opponent’s payoff, which our 
pigeons could not have, should not be 
necessary — feedback via reinforcement should 
suffice: If Al chooses tails more often, Bl 
would increase its rate of reinforcement by 
choosing tails too, which would discourage Al 
from playing tails. Furthermore, unequal 
payoffs should affect the opponent’s choice 
in opposite directions depending on whether 
it is playing Same or Different — tracking the 
richer alternative in the former, avoiding it in 
the latter. 

The effect of payoff changes on an oppo- 
nent’s choices expected from the Nash equli- 
brium is visible in the thick bars of Figure 4: 
Nash equilibria changed between games only 
when payoffs were changed for the opponent. 
In Game 2, when reinforcer duration for tails 
choices was increased for pigeons Al and A2, 
the Nash equilibria predicted a decrease in ps 
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and an increase in in pigeons B1 and B2. A 
similar change was predicted for pigeons A1 
and A2 in Game 3, when reinforcer duration 
for tails choices was increased for pigeons B1 
and B2. In Game 4, when reinforcer duration 
for heads choices was increased and for 
successful tails choices was decreased, ps was 
predicted to increase and po to decrease 
relative to prior games. Actual choices roughly 
followed these predictions. In Game 2, choices 
diverged less across roles for pigeons that 
experienced unequal payoffs (A1 and A2) 
than for their opponents (B1 and B2); once 
unequal payoffs were presented to B1 and B2 
in Game 3, their opponents’ choices also 
diverged across roles. In Game 4, the simulta- 
neous reversal of reinforcer duration for both 
pigeons either reversed choices across roles 
(pigeons A1 and A2) or at least reduced their 
difference (pigeons B1 and B2). Thus, consis- 
tent with the Nash equilibria, changes in 
choices between games were sensitive to 
changes in opponent’s payoffs. Moreover, 
rapid changes in choice with changes in role 
demonstrated that game strategy was con- 
trolled by visual stimuli that cued role. 

The second feature of choice at the Nash 
equilibrium revealed by Equation 3 is that, 
because utilities did not change across roles 
within the same pigeon (our first constraint 
for estimating utilities), ps = 1 — po for any 
pigeon. This is visible in Figure 4: Nash 
equilibria for both roles within each game 
were always equidistant to .5 and thus added to 
1. Relative to these predictions, however, 
choices in each role at the end of game appear 
to be biased towards the richer alternative 
(“©” symbols in Figure 4) in most games with 
unequal payoffs; the opposite bias was never 
observed. The sensitivity of pigeons’ choices to 
their own payoffs, however intuitive, was not 
expected from the Nash equilibria. 

The Nash equilibria provided normative 
predictions about asymptotic choice distribu- 
tion in MP competitions. These predictions 
were partially validated by the pigeons’ behav- 
ior. Consistent with predictions, choices were 
sensitive to the payoffs of the opponent; 
however, in conflict with predictions, choices 
were also sensitive to each pigeon’s own 
payoffs. These molar characteristics of ob- 
served choice may now serve to evaluate the 
global output of models of trial-by-trial perfor- 
mance. 


Learning Models 

Learning processes may be characterized by 
algorithms that specify trial-by-trial changes in 
performance as a function of various inputs 
from the organism and its environment 
(Sutton & Barto, 1998). An algorithm that 
accurately represents the learning process of 
pigeons in MP must be consistent with the data 
presented here: It must approximate the Nash 
equilibrium and be sensitive to role cues, but it 
must also display choice perseveration and 
oversensitivity to one’s own payoffs. We con- 
sidered two candidate algorithms to describe 
MP performance: A Stochastic Linear Learning 
model that incorporates choice perseveration 
(SLLp), and a Weighted Reinforcement Matching 
(WRM) model. Choice perseveration alone 
(EWMA, Equation 1) served as a default 
nonlearning algorithm against which learning 
models were evaluated. 

Stochastic Linear Learning Model with 
Perseveration (SLLp) 

Stochastic linear learning (SLL) models 
assume that learning agents maintain a prob- 
ability distribution of choices: If a choice is 
reinforced, the probability of that choice 
increases; if a choice is not reinforced, the 
probability of that choice either decreases or 
remains constant. This algorithm has been 
used often as the backbone of models of 
conditioning (Bush & Mosteller, 1951; Re- 
scorla & Wagner, 1972), and has also provided 
accurate descriptions of human and nonhu- 
man primate learning of mixed strategies 
(Erev & Roth, 1998; Lee et ah, 2004; 2005; 
Mookherjee & Sopher, 1994). In the only 
report of a MP game between laboratory 
animals, however, neither the Nash equilibri- 
um nor an SLL model captured the perfor- 
mance of pairs of rats (Flood et ah, 1983). This 
result may have been due to poorly discrimi- 
nable rewards and strong biases generated by 
visual contact between opponents. Neverthe- 
less, the demonstrable operation of SLL 
mechanisms in nonhuman learning suggests 
that a modified version of SLL may provide a 
good account of nonhuman performance in 
game environments. 

For any given alternative, the SLL model 
may be formulated as: 

Pt+i = pt + f{Ct,Tit). (4) 
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Table 3 


Probability change function in Eq. 4 
for choosing heads. 



Payoff (n,) 

Prior Choice ( Q 

0 

> 0 

Heads 

H (0 - M 

ar (1 “ A) 

Tails 

Zi- (1 - pi) 

Ar (0 — A) 

Equation 4 simply 

states that 

a change in 


probability of cboosing an alternative from 
one trial to the next is a function of tbe last 
choice made (Q) and the payoff for that 
choice {nj}. Table 3 specifies the change 
function / ( C,,ni) evaluated in this paper. The 
two-way table assumes tbat pt is the probability 
of choosing heads; an analogous table may be 
constructed for choosing tails by flipping tbe 
top and bottom cells. Two free parameters 
modulate tbe impact of a trial on tbe 
probability of choosing an alternative: Rate of 
extinction kjf, which operates when reinforce- 
ment is absent = 0), and rate of learning 
Aft, which operates when reinforcement is 
present (rc^ > 0) . The left cells of Table 3 
indicate that the probability of choosing an 
alternative, heads in this case, should decrease 
after each trial by a factor of if that choice is 
not reinforced, and increases by the same 
factor if the opposite choice is not reinforced. 
The right cells of Table 3 indicate that the 
probability of cboosing an alternative should 
increase after eacb trial by a factor of kft if that 
choice is reinforced, and decrease by the same 
factor if the opposite choice is reinforced. 

Whereas it may suffice to use a single 
parameter across all games to define rate of 
extinction kf, the use of two different positive 
payoffs, small and large, implies a potential use 
of two free parameters to define rate of 
learning kft. We relied on prior empirical 
research to minimize the number of free 
parameters — rate of learning kft was specified 
as a function of immediately preceding posi- 
tive payoffs (Borgers & Sarin, 1997), using 
Killeen’s (1985) value function for food 
duration in animals, 

;.« = !- (5) 

Parameter y is tbe curvature of the utility 
function for food rewards — higher values of y 
yield more concave functions. 


The SLL model is completed with the 
incorporation of choice perseveration into 
Equation 4, forming what will be called SLLp. 
Perseveration was incorporated by substituting 
pt in Equation 4 and Table 3 witb p’ t+\ from 
EWMA (Equation 1). Thus, in determining 
choice, the model assumes that perseveration 
operates first, transforming the just-prior 
estimate of choice probability p' t = pt into 
p’ t+i following EWMA (Equation 1); reinforce- 
ment then adds or substracts to the momen- 
tum-driven propensity following Equations 4 
and 5 and Table 3. 

The nesting of EWMA into SLLp yields two 
equations that represent the change in prob- 
ability of choosing an alternative following 
reinforced and nonreinforced trials, respec- 
tively, 

II Tit > Q, pt + \ = 

(6a) 

-I- (1 — d.f)[pt + [I — e — pt)], 


(6b) 

ksil — Q) + (1 ~ kE)[aLCt+ (1 — 0Lf)pt\. 

Whereas EWMA has a single free parameter, ot, 
SLLp model has three free parameters: Persis- 
tence {df, the subscript distinguishes it from 
EWMA’s parameter), concavity of utility func- 
tion (y), and rate of extinction {kf). Note that 
Equation 6a can be reduced to EWMA by 
setting y = 0, and Equation 6b by setting kf = 
0. That is, if choice is insensitive to reinforce- 
ment and extinction, only persistence oper- 
ates. Equation 6a implies that persistence and 
reinforcement determine choice following a 
reinforced trial: The first term on the right- 
hand side indicates that as ^ 1 (strong 
persistence), choice depends more on the just- 
prior choice; the second term indicates that as 
df 0, choice changes in the direction of the 
reinforced choice at a rate determined by the 
utility of the reinforcer. Similarly, Equation 6b 
implies that extinction and persistence deter- 
mine choice following a nonreinforced trial: 
The first term on the right-hand side indicates 
that as kft 1 (high sensitivity to nonrein- 
forcement), choice alternation is more likely; 
the second term indicates that as kf 0, 
choice is determined more by its own momen- 
tum. 

To account for stimulus control, tbe model 
assumed separate values of pt for each role. 
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When roles changed between t and t+ \, pt+\ 
was predicted by the last p' estimated for the 
new role. However, the values of Q and Tit used 
to predict pt+\ were those of the preceding 
trial, regardless of role. 

The SLLp model was evaluated against 
EWMA (Equation 1). As had been done 
already with EWMA, SLLp’s free parameters 
were estimated by fitting predicted choice 
probabilities to actual choices, using the 
method of maximum likelihood. The relative 
merit of each model was determined by log- 
likelihood ratios (LLRs). Log-likelihood ratios 
were computed for each pigeon as the log of 
the ratio of the likelihood of SLLp over 
the likelihood of EWMA (cf. footnote 1). 
Parameter estimates and LLRs are shown in 
Table 2. According to LLRs, the data were 
substantially to more probable given 
SLLp than given EWMA. This difference 
cannot be accounted for only by the difference 
in the number of free parameters (one in 
EWMA, three in SLLp). On the basis of the 
Akaike Information Criterion (Burnham & 
Anderson, 2002), two additional free parame- 
ters would cast doubt on inferences based on 
LLRs of about 4, but not on 3-digit LLRs. 

It should not be suprising that SLLp 
predicted data better than EWMA. The large 
LLRs indicate that choices in MP are not 
driven just by perseveration alone, and that 
SLLp captures at least part of the unaccounted 
variance by attributing it to an obvious suspect, 
reinforcement. This does not mean that 
perseveration is unimportant; in fact, the 
incorporation of reinforcement effects re- 
vealed stronger choice perseveration than 
EWMA estimated for most pigeons. Nonethe- 
less, to evaluate SLLp’s account of reinforce- 
ment in MP it is necessary to compare it 
against another dynamic model of choice. 

Weighted Reinforcement Matching (WRM) 

A well established regularity in choice 
behavior is that the proportion of choices of 
an alternative matches the proportion of 
reinforcement obtained from that alternative 
(Herrnstein, 1961); this regularity is known as 
the matching law (Herrnstein, 1974). This is a 
global aspect of behavior for which local 
mechanisms are still uncertain (Lefebvre & 
Sanabria, 2008; MacDonall, 1999). Weighted 
Reinforcement Matching (WRM) is proposed 
here as a local mechanism that, consistent with 


melioration theory (Herrnstein & Vaughan, 
1980), assumes that the matching law operates 
locally. More specifically, WRM postulates that, 
when choosing between alternatives, the pro- 
portion of recent payoffs obtained from each 
alternative is compared against the proportion 
of recent opportunities in which that alterna- 
tive was chosen. If the former were larger than 
the latter, the alternative would be likely 
chosen; if the former were smaller than the 
latter, the alternative would not be likely 
chosen. This implies a significant contrast with 
the low cognitive sophistication of SLLp: 
Whereas the mnemonic demands of SLLp 
are restricted to the agent’s own behavior (i.e., 
the propensity towards each choice), local 
matching also demands separate counts of 
rewards obtained from each alternative. 

Although choice distribution generally cov- 
aries with reward distribution, there are two 
well known systematic deviations from strict 
matching: Bias and sensitivity to reward size 
(Baum, 1974). Bias towards an alternative is 
the reward-independent tendency to choose 
that alternative; a high/low sensitivity to 
reward size means that larger rewards are 
chosen more/less than expected from strict 
matching (Alsop & Porritt, 2006) . For binary 
choices, these deviations are accounted for by 
the general form of the matching law (Baum, 
1974; Davison & McCarthy, 1988), 


pA 


I^ARaY + {RbY’ 


(V) 


where pA is the long-term probability of 
choosing alternative A instead of B, Ra is the 
cumulative reinforcement obtained from 
choosing A divided by the total number of 
trials, Pa is bias towards choosing A, and i 
is sensitivity to reward size expressed as a 
power function^. 

The probability p of choosing an alternative 
may be dynamically updated by substituting R 
with a EWMA of reinforcement of that choice, 

Rt+\ = <y-MHtCt + (1 - ( 8 ) 


^Equation 7 is mathematically equivalent to the more 
conventional ratio expression 


E 



(F2) 


where the left-hand side of the equation is the ratio of 
choices. 
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According to Equation 8, if a choice is not 
reinforced {Tit ~ 0) or not made {Q = 0), 
average reinforcement of that choice declines 
at a rate of 1 — If = 0, Rt is constant and 
equivalent to Ra in Equation 7; values of oim 
closer to 1 weigh recent payoffs more heavily 
in updating average reinforcement. Parameter 
may be regarded as the rate of persistence 
of memory for past choices and reinforce- 
ment — as memory of old items fades, their 
weight in the computation of average rein- 
forcement is reduced. The WRM model 
incorporates this memory into the matching 
function of Equation 7, thus involving three 
free parameters: memory persistence (otjvf), 
sensitivity to reward size (i), and bias toward 
choosing heads {fi heads, Ptails = 1 / P heads) ■ 

To implement WRM, Equation 8 was com- 
puted for each alternative, on each trial; Rt+\ 
for heads and tails were entered as Ra and R^, 
respectively, in Equation 7. A separate register 
of Rt was kept for each role such that, if roles 
changed between t and t + \, Rt+i depended 
on the last R computed for the new role. 
However, the values of Ct and Ttj used to 
compute Rt+i were those of the preceding 
trial, regardless of role. 

Eree parameters were estimated using the 
maximum likelihood method. According to 
these estimates (shown in Table 2), all pigeons 
were relatively insensitive to reward size (i < 
1), and most of tbem were biased toward 
choosing heads {fi heads > !)■ This account was 
more likely than EWMA to yield the observed 
data, as shown by LLRs that ranged between 
75.3 and 135. These large LLRs preclude the 
possibility that the advantage of WRM over 
EWMA was due to additional free parameters. 

SLLp versus WRM 

Despite the superior performance of WRM 
relative to EWMA, it did not surpass SLLp in 
predicting the obtained data. Log-likelihood 
ratios were larger for SLLp than for WRM for 
every pigeon. Differences between LLRs, 
which evaluate each learning model against 
EWMA, are in fact the LLRs between SLLp 
and WRM. These differences ranged from 4.5 
(pigeon B2) to 130.4 (pigeon A2), favoring 
SLLp. The superiority of SLLp over WRM 
could not depend on the number of free 
parameters because there were three free 
parameters for both models. Thus, SLLp 


provided a better account of MP performance 
tban WRM did. 

The superiority of SLLp relative to WRM 
suggests that choices probably did not depend 
on separate counts of rate of reinforcement 
for each alternative action. Instead, SLLp 
implies a simpler behavioral machinery, a 
“stream” of choices that carried its own 
momentum and that was directed by rein- 
forcement. But, could sucb a simple mecha- 
nism yield the global patterns of behavior 
depicted in Fig. 4? Would its output be 
sensitive to Nash equilibrium in eacb role? 
Would it produce own-payoffs biases? To 
answer these questions, we resorted to simula- 
tions of SLLp. 

Simulation of SLLp 

We attempted to reproduce the observed 
performance of pigeons by feeding an SLLp 
simulator witb the best fitting parameters 
(Table 2). The simulator was run 1000 times 
on the same sequence of games experienced 
by the pigeons. Results are shown in Figure 5. 
Like the actual pigeons, performance of 
simulated pigeons that learned using SLLp 
approached the Nash equilibria (note the 
separation of the lines in SimBl and SimB2 
during Game 2) and were oversensitive to their 
own payoffs (note changes in performance 
from Game 1 to Game 2 in SimAl and A2, and 
from Game 2 to Game 3 in SimBl and SimB2). 

Despite the approximate reproduction of 
the pigeons’ performances, SLLp simulations 
showed two significant departures from the 
empirical data. First, pigeons did not acquire 
asymptotic performance as fast as the mean 
SLLp simulation performance suggests. Sec- 
ond, the asymptotic sequential variance of 
choice proportions across experimental ses- 
sions was systematically lower than in its 
simulated counterpart (the seemingly stable 
curves shown in Figure 5 are due to the 
averaging of 1000 simulator runs). Both 
inconsistencies point in the same direction: 
SLLp parameters are mostly dependent on fast 
within-session changes in choice, and thus 
misrepresent substantially slower between-ses- 
sion changes. By assuming that fast changes in 
choice carried over from one session to the 
next, the simulations yielded steeper acquisi- 
tion curves that were less stable at asymptote 
than those observed in pigeons. 


180 


FEDERICO SANABRIA and ERIC THRAILKILL 


0.75 


0.50 


O 

« 

u 

ro 


^ 0.25 

o 0.75 

c 

g 

o 

CL 

g 

D. 


0.50 


0.25 



Sessions 


Fig. 5. Mean performance of 1000 simulations of SLLp, with parameters extracted from pigeons’ performance. The 
simulations reproduce reversals in preference across roles predicted by Nash equilibrium, as well as own-payoff effects. 
Notation is as in Figure 4. 


Differences in within- and between-session 
learning rates are regularly observed in labo- 
ratory animals in various experimental prepa- 
rations, and are best illustrated by the phe- 
nomenon of spontaneous recovery (Bouton, 
2002): Animals learn to stop responding when 
reinforcement is discontinued, but responding 
is likely to recover spontaneously after an 
interruption of experimental conditions, such 
as the interval between experimental sessions. 
Important associative processes appear to 
underlie spontaneous recovery (Robbins, 
1990). The net result is that, at the beginning 
of each experimental session, learning does 
not pick up where it left off at the end of the 
last session, but is somewhat backtracked. 

Choice acquisition by pigeons also appeared 
to backtrack at the beginning of each MP 
session. This effect is illustrated by the 
performance of pigeon B2 in Game 4, while 
playing Same (dotted curve in the bottom-right 
panel of Figure 4) . In Game 4, B2 transitioned 
from choosing heads with p = .35 to p = .78. 
Backtracking is revealed when p is plotted for 


the first and second half of each session, as in 
Figure 6. Before transition (first four sessions) , 
p repeatedly declined between the first and 
second half of each session, suggesting a fast 
within-session learning process that back- 
tracked between sessions. During and after 
transition, downward within-session trends in p 
became rare; substantial within-session chang- 
es in p were typically upward, suggesting a 
reversal in the within-session learning process, 
often interfered by prior “downward” learn- 
ing. The reversal of within-session trends is 
more clearly depicted in the inset of Figure 6: 
Downward trends in p during the first four 
sessions (negative change) were followed by a 
mixture of upwards (positive) and flat trends 
in subsequent sessions. Similar backtracking 
was also visible in other pigeons while substan- 
tial changes in choice were in progress 
(Pigeon Al, Game 4; Pigeon Bl, Game 2, 
playing Same; Pigeon A2, Game 4, playing 
Same) . 

Within-session changes in choice around 
transitions suggest that a more precise account 
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Fig. 6. Proportion of heads choices by pigeon B2 in 
Game 4 while playing Same, in the first and second half of 
each session (open and closed circles, respectively). The 
general trend in change of choice over the game is 
depicted by a continuous sigmoidal function, y = yo + (« “ 
^o) / (1 + « ~ ~ ^ where yo. h and cwere fitted to 

choice proportions. The inset shows the difference in 
choice proportions between the second and first half of 
each session. 

of multisession game behavior would be 
attained if the processes underlying spontane- 
ous recovery were incorporated to the SLLp 
model. The specifications of this module 
could be laid out in future developments of 
SLLp. 

DISCUSSION 

Like humans (Rapoport & Budescu, 1992) 
and other primates (Lee et ah, 2004), pigeons 
can learn to compete efficiently in MP games. 
Such efficiency, however, is somewhat abated 
by a slight yet consistent tendency to persist in 
prior choices and an excessive responsiveness 
to their own payoffs. Choice persistence in 
pigeons contrasts with over-alternation in 
adult humans playing mixed strategy games 
(Brown & Rosenthal, 1990). Response persis- 
tence, however, is not an isolated phenome- 
non: It has been reported in other species 
(Lee et al., 2004) and in other experimental 
contexts (Baum & Davison, 2004; Killeen, 
2003; Reboreda & Kacelnik, 1993; but see 
Dember & Fowler, 1958) . Moreover, persis- 
tence cannot be reduced to tbe automatic 
repetition of a motoric response because, in 
the experiment reported here, choices were 
separated by intertrial intervals of 10 s or 
more. This divergence from optimality in the 
opposite direction of humans suggests that 


overalternation depends on cognitive idio- 
syncrasies of Homo sapiens] perhaps it is a 
verbally acquired misrepresentation of ran- 
domness, thus absent in preverbal children 
and nonhumans. This hypothesis has yet to 
be evaluated. 

Suboptimal sensitivity to one’s own payoffs 
in mixed strategy games bas been observed in 
human players (Binmore, Swierzbinski, & 
Proulx, 2001; Ochs, 1995). Prior to this 
experiment, however, the manipulation of 
payoffs necessary to evaluate this effect had 
not been conducted in nonhuman subjects. 
Our results indicate an important invariance 
across species that further research should 
confirm; it implies that accounts of own- 
payoffs effects (Goeree, Holt, & Palfrey, 
2003) may be modeling basic behavioral 
processes that humans share with other 
species. 

A stochastic linear learning (SLL) algo- 
rithm, which has successfully described human 
performance in mixed strategy games (Erev & 
Roth, 1998; Mookherjee & Sopher, 1994), 
approximated optimal performance and re- 
produced own-payoffs effects. With the addi- 
tion of a persistence module, this algorithm 
(SLLp) described pigeon performance better 
tban the more sophisticated reward-tracking 
WRM model. This result, however, does not 
rule out reward tracking in other species, or 
other forms of reward tracking different from 
dynamic matching. Research in this direction 
has been conducted by Lau and Glimcber 
(2005) and Corrado, Sugrue, Seung, and 
Newsome (2005). Their studies suggest re- 
ward-tracking mechanisms that account for 
the performance of rhesus monkeys in con- 
current schedules of reinforcement without 
relying on dynamic matching. Ideally, learning 
models should be developed to account for 
performance across schedules of reinforce- 
ment and, to the extent that it is possible, 
across species. This study and those by Lau and 
Glimcber and Corrado and colleagues are 
formulating some candidate models. The 
challenge, now, is to identify common ground 
and devise critical tests. 

An important implication of SLLp is that it 
does not require players to learn anything 
about tbeir opponent, but only about their 
own outcomes. The approximation of pigeons’ 
choices to the Nash Equilibrium did not (and 
could not) depend on them “knowing” their 
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opponent’s payoffs, as it is presumed in game- 
theoretical agents. Similarly, the formation of 
evolutionarily stable strategies does not de- 
pend on a population “knowing” the fitness 
payoffs of another population (Maynard 
Smith, 1974). Whether over trials or over 
generations, a rudimentary feedback loop 
suffices for the formation of stable stochastic 
strategies. Choices in mixed strategy games 
thus appear to be susceptible to the often 
invoked parallelism between instrumental 
learning and natural selection (cf. Skinner, 
1981). 

It is possible that sophisticated cognitive and 
social processes, such as hypothesis testing and 
considerations of fairness, play a role in mixed 
strategy game playing, particularly in humans. 
To invoke such processes, they must account 
for variance that is not accounted for by more 
parsimonious learning mechanisms, such as 
those described here. Critical tests for the 
involvement of complex processes are, thus, 
defined in part by how simpler processes 
operate in game-theoretical scenarios. Because 
these simpler processes are more efficiently 
studied using animal models, we hope to see 
the field of behavioral game theory extend its 
domain into the acquired behavior of nonhu- 
man species. 
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